Version: December 2017
Essential FunctionsListed below are the essential functions of the hosted infrastructure that need to exist for the service to continue running.
The priority will determine the time-frame the service will become available again. This is noted later in the document.
- Servers (Server Roles include Domain Services, IIS, SQL, SFTP)
- Network Connectivity
- Data (SQL, attachments, logs)
- Application software
- Remote system administration
- Reporting services
- SMTP services
- Performance monitoring
RisksRisks to the stability of the service may include (but not limited too);
- Power loss at hosted site
- Fire/water or other catastrophic issue
- Server failure
- Power supply failure
- Memory or disk error
- Corrupted software
- Incompatible security patch
- Security failure
- Malicious damage caused by software or unauthorised access
- Network Infrastructure failure
- Loss of internet connection at host site
- Network device failure
Preventative MeasuresThe following is a list of strategies that have been put in place for data protection.
- Local SAN storage
- Daily whole server backups
- Replication of data to an off-site location
- Data is backed up and encrypted, and transferred to SAN on secondary data centre
- Blade technologies are being used across all servers to increase data reliability
- UPS and diesel generator power systems - The servers are not reliant on the local power grid to guarantee around-the-clock power. On-site diesel-powered generators and uninterruptible power systems (UPS) deliver redundant power if a critical incident occurs. This ensures all operations are uninterrupted and the dedicated servers remain online.
- Redundant climate control systems - The heating ventilation air conditioning (HVAC) systems have full particle filtering and humidity control. The climate within the data centre is maintained according to ASHRAE Guidelines. This ensures that the mission-critical dedicated servers and hardware is functioning at its best.
- Fire Suppression - VESDA detection with clean agent fire extinguishers
- The data centre is locked and guarded - can only be accessed by authorised personnel.
- Monitored closed circuit televisions and 24x7x365 onsite security teams vigilantly protect the data centre, while military-grade pass card access and bio metric finger scan units† are in place to provide even further security.
- The data centre is a multi-level low-rise building with a raised floor.
- 24x7x365 NOC Support Network Operation Centres (NOC) supplies 24x7x365 support. The NOC monitors the network, while engineers and data centre personnel keep the facilities running smoothly. Around-the-clock access to phone and online support is also available.
- The infrastructure is regularly tested to check performance to ensure redundancy in the event of an emergency.
- Fail-over from principal to the mirrored database is regularly tested to ensure data redundancy in the event of an emergency.
- Vulnerability scanner is used to detect risks to all servers. All risks identified are addressed immediately.
- Performance of the servers are monitored which allows for Point Progress to plan future scalability.
- Event logs maintenance is scheduled regularly to help detect intrusion.
- Application prodding to ensure that the applications are responding 100% of the time.
Server fails to respond
Remotely power cycle the devices via means of the APC Reboot Device provided by the hosting company. If the server still does not respond following a reboot we would escalate the issue to the hosting company for their usual support process.
Database failure - application and data becoming unavailable
The principal database will be failed over to the mirrored database. The mirrored database has an exact replica of the data in the principal database.
The application will need to be updated with the mirrored database. In the event of the failover the users that have sessions open on the application will lose their connection but will be able to log back into the application with minimal data loss.
The database is mirrored every second. If a session is open and there is unsaved data at 10:00am and the database failure at 10:00am the unsaved data cannot be recovered by the database replica.
This would involve a failover from the principal database to the mirrored and the application connection updated to the mirrored database. It would take approximately 30 minutes to restore service once the failure has been discovered.
Critical failure - server replaced
Replacement server is provided within the hour. SAN backup applied to the replacement server.
If the mirrored database is still available, the data will be recovered in the same way the data is recovered for a failed database.
This would involve a replacement server, software installation (SQL), application installation and database restore. It would take approximately 4 hours to restore service.
StrategyThe MyExpensesOnline service is delivered on a series of virtual servers powered by VMWare.
Primary SiteThe core application and database functions are provided through a server farm located in Milton Keynes.
The physical infrastructure utilises an array of blade servers, allowing for hardware failure without interruption to the virtual servers above.
Data RecoverabilityPoint Progress perform nightly backups of all client data.
Complete system copies are taken every night which is stored on local SAN as well as on the secondary site SAN.
Redundancy is provided in the design of the server solution:
- Multiple application servers are utilised, separating application and database functions.
- An IIS farm is utilised to provide redundancy as well as load balancing.
- SQL Servers are used as the database infrastructure in a configuration providing excellent resilience.
A clone of the production infrastructure is in place to provide a staging point for testing prior to deployment to live.
All security patches being applied to the staging servers in advance of installation into production.
Secondary SiteIn the case of catastrophic failure of the primary data centre, the system will fail over to a secondary site located in Reading, Berkshire.
SupportThe infrastructure of the support systems has been de-centralised to ensure that in the case of catastrophic failure, the support personnel would be able to operate from alternative premises.
In the event of a loss of internet connectivity or other outage at the Point Progress office, emergency arrangements – if required – can be made to enable appropriate access to the server infrastructure during the downtime.
Any outage in the Point Progress offices will have no impact to the hosted service.